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1. Introduction: Aims and data! 


The findings to be presented in this paper were not anticipated, but came 
about as an unexpected result of looking at how the application of a version 
of the Levenshtein distance to word lists compares with cognate counting. 
We were interested in the degree to which the two correlate. The results 
of this investigation are intrinsically interesting and will be presented in 
the following section 2, but even more interesting is our finding that differ- 
ences between counting cognates and measuring the Levenshtein distances 
vary as a function of average word lengths in the word lists compared. This 
observation will occupy the remainder of the paper, with section 3 devoted 
to establishing the statistical significance of the observation across language 
families, while section 4 establishes the significance within language groups, 
and section 5 discusses competing explanations. First we briefly explain the 
specific version of the Levenshtein distance used and the concept of cognate 
identification. 

In numerous previous papers, beginning in Holman et al. (20082), the 
present authors as well as other members of the network of scholars partici- 
pating in the project known as ASJP (or Automated Similarity Judgment 
Program) have applied a computer-assisted comparison of word lists in order 
to derive a measure of differences among languages. Our method consists in 
comparing pairs of words to determine the Levenshtein distance, LD, which 
Is defined as the number of substitutions, insertions, and deletions necessary 
to transform one word into another. The LD is divided by the length of the 
longer of the two words compared such that any distance will come to lie 
in the range 0%—100%. This normalized measure, called LDN,’ is averaged 
over all pairs of words referring to the same concept in lists from two given 
languages. To enhance discrimination between related and unrelated lan- 
guages, this average LDN is further divided by the average LDN between 
words referring to different concepts in the different lists, to obtain what 
we call LDND (‘Levenshtein Distance Normalized Divided"). A similarity 
measure, here called ASJPsim, is defined by subtracting LDND from 100%. 
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The higher performance of LDND in comparison to LDN for the purpose 
of classifying languages is supported in Pompei et al. (2011) and Wichmann 
et al. (2010a), and Huff and Lonsdale (2011) report similar performances of 
LDND and the more linguistically informed but also much more computer- 
intensive ALINE algorithm of Kondrak (2000). Greenhill (2011) reports a 
low performance for LDN (not looking at LDND), but limits the investi- 
gation to the specific case of the Austronesian languages. 

LDND and ASJPsim have been put to various uses, such as the dating of 
proto-languages (Holman et al. 2011), the identification of geographical cen- 
ters of linguistic diversity for the purpose of identifying homelands (Wich- 
mann et al. 2010c), the estimation of the limitations of word list comparisons 
for identifying deep genealogical relationships (Wichmann et al. 2010b), and 
the study of the relationship between population sizes and language change 
rates (Wichmann and Holman 2009). As objective and easily-obtained mea- 
sures of the difference and similarity between any given pair of languages, 
LDND and ASJPsim are potentially useful for the investigation of possible 
correlations between languages and other kinds of data, such as data per- 
taining to human culture, prehistory, biology, and ecology. 

A different method of measuring similarities between languages is that 
of counting cognates (related words) on a fixed list of lexical concepts. The 
percentage of concepts for which the words are cognate in two given lan- 
guages is here called COGNsim. This method was developed within the 
framework of lexicostatistics (e.g., Swadesh 1955). In more recent years, 
cognate identification has been used to establish cognate classes as input 
to character-based phylogenetic methods, and a variety of issues have been 
explored using such methods, including dating and classification of language 
groups (Gray and Jordan 2000; Gray and Atkinson 2003), identification of 
factors that affect speed of lexical change (Pagel et al. 2007; Atkinson et al. 
2008), questions of homelands and language expansions (Gray et al. 2009; 
Walker and Ribeiro 2011), and relationships between the evolution of differ- 
ent cultural traits (cf. Mace and Jordan 2011 for a review). These studies have 
mostly been carried out in relation to the three largest groups of languages 
where word lists coded for cognacy are available: Indo-European (Dyen et 
al. 1992), Austronesian (Greenhill et al. 2008), and Bantu (Bastin et al. 1999). 

Identifying a cognate pair of words 1s not a trivial task. In the ideal situ- 
ation a full set of sound correspondences is available which will allow the 
researcher to match up related words correctly, but the prior identification of 
regular sound correspondences requires hundreds, if not thousands, of sets 
of word comparisons. Such information is rarely available. Thus, it is more 
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common to resort to some version of what Gudschinsky (1956: 615) calls the 
"inspection method", which essentially amounts to educated guesses. 

The aim of this paper is to compare ASJPsim to COGNsim. ASJPsim is 
based on the 40 items identified by Holman et al. (2008b) as the most stable 
in Swadesh's (1955) 100-item lexicostatistical list. This can be compared to 
COGNsim at two different levels of resolution, depending on the type of 
data available. Many published studies present matrices of cognate percent- 
ages based on Swadesh's 100-item list, his earlier 200-item list (Swadesh 
1952) or a modification of one of these? A minority of studies additionally 
provide word lists where each lexical item is identified as belonging to a 
given cognate class, thus allowing for a higher resolution of the comparison 
in the sense that ASJPsim can be compared to judgments of cognacy at the 
level of words. These comparisons are based on the items in the 40-item 
ASJP list that are also included in the lists used in the study. For simplicity, 
all calculations are based on the first synonym listed if the source includes 
more than one synonym for a concept. Loanwords identified as such in the 
source are omitted from the calculations. 

Table 1 provides an overview of data and sources. These have been found 
by a search of pertinent literature. Undoubtedly more could be added, but the 
sample is sufficiently large and has a sufficient spread in terms of geography 
and genealogies (language families) that it allows us to test the statistical 
significance of observations made. Language family designations from 
Ethnologue (Lewis 2009) are followed by the abbreviations that we will 
use in later tables. We follow the sources in naming the different language 
groups.* See the legend after the table for abbreviations of language group 
names. These names will be used throughout this paper to identify a dataset 
from a specific source. For instance, in the context of references to data we 
use “Austronesian” for a small set of languages whose cognate percentages 
are given in Dyen (1965) rather than for the family as a whole. The appendix 
shows how we match languages in the sources with word lists in the ASJP 
database (Wichmann et al. 2012). We additionally provide a checkmark (V) 
following the reference when word lists encoded for cognate classes were 
available,” and finally we provide the number of languages (N) within each 
group for which data were available for both COGNsim and ASJPsim. The 
total number of languages sampled amounts to around 8% of the world's 
languages by the definition of Lewis (2009). The sample includes 24 fami- 
lies from all world areas, with no major skewing: Africa (3), Eurasia and 
SE Asia (7), the Pacific (6), North America (3), South America (3), Middle 
America (2). 
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Table 1. Overview of sources and the nature ofthe data 


Family Abb. Group Source N 
Afro-Asiatic AA Afras Militarev (2000) y 18 
Cushitic Bender (1971) 28 
EthSem Bender (1971) 13 
Omotic Bender (1971) 21 
Altaic Alt Turkic Troike (1969) 6 
Australian Aus Daly Tryon (1974) 13 
Iwaidjan R. Mailhammer (p.c., 2011) 3 
Mayi Breen (1981) 5 
Paman Sommer (1976) 3 
WAustr O’Grady (1966) 7 
WBarkly Chadwick (1979) 3 
Worrorran McGregor & Rumsey (2009) 9 
Austro-A siatic AuA MonKhm Peiros (1998) V 16 
Austronesian An Austr Dyen (1965) 10 
Malagasy Vérin et al. (1969) 18 
Melan Z'Graggen (1969) 6 
Morob Hooley (1971) V 55 
NHebr Tryon (1973) 16 
Philip Llamzon (1976) 72 
Yapen Anceaux (1961) 18 
Carib Car Cariban Villalon (1991) 10 
Dravidian Dra Dravidian Andronov (2001) 10 
Hmong-Mien HM MiaoY Peiros (1998) V 6 
Indo-European IE IndEur Dyen et al. (1992) 55 
Japonic Jap Japonic Hattori (1961) 5 
Mayan May Mayan C. H. Brown 30 
(p.c., 2011) N 
Macro-Ge MGe Ge Wilbert (1962) 9 
Mixe-Zoque MZ MiZo Cysouw et al. (2006) V 10 
Na-Dene NDe Athap Hoijer (1956) Y 15 
Niger-Congo NC Atlantic Sapir (1971) 21 
Benue-Congo Bennett & Sterk (1977) 22 
Gur Swadesh et al. (1966) 20 


Kwa 


Heine (1968) 
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Family Abb. Group Source N 
Nilo-Saharan NS ESud Thelwall (1981) 12 
NilSah Bender (1971) 23 
Southern Luo Blount & Curley (1970) 5 
Quechuan Que Quechua Torero (1970) 9 
Salishan Sal Salish Swadesh (1950) 21 
Sino-Tibetan ST Chinese Xu (1991) 6 
LoloB Peiros (1998) V 15 
SinTib Benedict (1976) 7 
Tai-Kadai TK Kadai Peiros (1998) V 11 
Torricelli Tor Kamas Sanders 4 7 
Sanders (1980) V 
Trans-New Guinea TNG Angan Lloyd (1973) 12 
Awyu Voorhoeve (1968) 6 
Bosavi Shaw (1986) 22 
Eleman Brown (1973) 8 
Finisterre Claasen & 12 
McElhanon (1970) 
GVDani Bromley (1967) 7 
GrMad Z'Graggen (1969) 50 
Huon McElhanon (1967) N 14 
Kiwaian Wurm (1973) 8 
Koiarian Dutton (1969) 6 
Kolopom Voorhoeve (1968) 3 
Ok Voorhoeve (1968) 5 
TurKik Franklin (1973) 4 
Uto-Aztecan UA Uto-Aztecan Miller (1984), Cortina-Borja 26 
& Valifias (1989) 
West Papuan WP NHalm Chlenov (1986) 8 
Yawa Jones (1986) 6 


Legend: Afras: Afrasian; EthSem: Ethiosemitic; WAustr: West Australian; WBarkly: West Barkly; 
MoKh: Mon-Khmer; Austr: Austronesian; Melan: Melanesian; Morob: Morobe; NHebr: 
New Hebrides; Philip: Philippines; MiaoY: Miao-Yao; IndEur: Indo-European; MiZo: Mixe- 
Zoquean; Athap: Athapaskan; ESud: Eastern Sudanic; NilSah: Nilo-Saharan; LoloB: Lolo- 
Burmese; SinTib: Sino-Tibetan; Kamas: Kamasau; GVDani: Grand Valley Dani; GrMad: 
Greater Madang; TurKik: Turama-Kikorian; NHalm: North Halmahera. 
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2. Comparing ASJPsim and COGNsim 


The number of data points for ASJPsim and COGNsim for individual lan- 
guage pairs is so large that it is unwieldy for visual inspection. However, to 
illustrate an interesting tendency in the data we plot language pairs from five 
different families in figure 1. 


COGNsim 


ASJPsim 


Figure 1. Scatter-plot of COGNsim as a function of ASJPsim for individual 
language pairs pertaining to five selected families: Australian (0), Uto- 
Aztecan (9), Japonic (+), Carib (A), Sino-Tibetan (x). 


Figure 1 illustrates, for selected data, that different families tend to 
occupy different regions in a scatter-plot of COGNsim and ASJPsim. For 
instance, Australian language pairs tend to stay close to the diagonal, 
whereas Sino-Tibetan language pairs occupy a region where low values for 
ASJPsim correspond to high values for COGNsim. Language pairs from 
other families occupy regions in between. 

Table 2 provides data on the averages of ASJPsim and COGNSim for all 
language pairs within each family as well as Pearson's r for the correlation 
between mASJPsim and mCOGNsim across all language pairs belonging to 
the family. 

Reviewing the second and third columns in table 2 we observe that 
mCOGNsim, the average cognate similarity within families, is always 
greater than mASJPsim, the corresponding average ASJP similarity. This is 
because cognates can be less than 10096 similar. 
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Table 2. Data on mean ASJP similarities and mean cognate similarities 


Family mASJPsim mCOGNsim r 

AA 12.30 23.13 0.856 
Alt 57.84 75.57 0.600 
An 23.23 31.80 0.726 
AuA 13.82 38.66 0.788 
Aus 19.36 32.00 0.886 
Car 20.42 52.00 0.765 
Dra 23.11 33.13 0.869 
HM 10.67 72.36 0.715 
IE 9.48 24.05 0.921 
Jap 50.70 78.00 0.921 
May 29.14 47.82 0.862 
MGe 29.31 66.56 0.451 
MZ 46.42 66.10 0.806 
NC 7.54 31.92 0.741 
NDe 19.51 51.83 0.739 
NS 6.67 12.53 0.934 
Que 54.63 82.43 0.325 
Sal 11.79 24.56 0.841 
ST 14.62 66.85 0.576 
TK 21.69 62.81 0.788 
TNG 16.95 31.69 0.836 
Tor 63.41 71.52 0.929 
UA 14.29 46.83 0.837 
WP 49.74 75.21 0.648 


In figure 2 we plot the relationship between mCOGNsim and mASJP- 
sim. The dotted line, provided as a point of comparison, intercepts at zero 
and has a slope of 1. The solid line shows the results of a linear regression, 
where r — 0.762 and p « 0.0001. Its slope is 0.93, which is so close to 1 that 
the intercept, at 25.96%, can be interpreted as the percentage that roughly 
needs to be added to get from mASJPsim to mCOGNsim. The relatively 
high r and the low p show that ASJPsim and COGNsim are parallel mea- 
sures of similarities among languages. However, we observe a cone-shaped 
distribution of the dots in the chart, with a tremendous amount of variation 
in mCOGNsim for low values of ASJPsim and an increasingly narrower 
concentration around the regression line for high values of ASJPsim. This 
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reflects the sort of distribution exemplified in figure 1, where language pairs 
in different families form clouds in different regions of the chart, except 
that here (in figure 2) we represent each family as a single data point. In the 
following section we turn to possible explanations for this variability. 


mCOGNsim 


mASJPsim 


Figure 2. Scatter-plot of mCOGNsim against mASJPsim 


3. mCOGNsim vs. mASJPsim in relation to segment inventory size 
and word length 


Given that the relation between ASJPsim and COGNsim is partly a “family 
matter” we need to somehow capture the relationship across all language 
pairs pertaining to each family. One way of doing this is simply to take the 
average of the difference between COGNsim and ASJPsim, which we call 
mDIFF. Another, more principled approach involves looking at how the two 
depend on time. Glottochronology (Lees 1953) assumes that the logarithm 
of COGNsim diminishes in proportion to time, and a similar assumption, 
supported by evidence from 52 archaeological, historical and epigraphic 
calibration points, is made for ASJPsim in Holman et al. (2011). Thus, the 
log of COGNsim should be proportional to the log of ASJPsim. A useful way 
of characterizing the relation between COGNsim and ASJPsim, then, is to 
find the average ratio, mRATIO, of log(COGNsim) to log(ASJPsim) across 
language pairs of a given family. 
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It could be the case that the size of phonological inventories affects the 
rate of change in segments. If a language has a relatively large number of 
segments the phonetic space occupied by each will be relatively small. In this 
situation a phonetic fluctuation will perhaps more easily cross the phonologi- 
cal boundary of a neighboring segment in the phonetic space, leading to free 
variation, which may eventually lead to articulatorily driven phonological 
change. Perceptually driven changes would also seem to occur with a higher 
probability when the phonetic space is more densely packed, since language 
users would be more prone to misperceive a sound when phonetically similar 
sounds constitute part of the inventory. These hypothetical factors trans- 
late into the testable prediction that the difference between mASJPsim and 
mCOGNsim should be positively correlated with the average number of 
phonological segments found in the group of languages concerned. 

While we do not have access to full segment inventories for the languages 
in our sample, we can use as proxies the number of different segments in the 
ASJP transcription system (or ASJPcode, cf. Brown et al. 2008) found in 
the 40-item word lists. In Wichmann et al. (2011) the number of different 
segments in word lists was used as a proxy for segment inventory sizes in a 
successful way, inasmuch as we were able to confirm well-known correla- 
tions involving segment inventory sizes (Hay and Bauer 2007; Nettle 1995, 
1998) using the mean number of segments represented in word lists (mSR). 

Another set of hypotheses is that shorter words tend to change faster 
phonologically than longer ones or that speakers ofa language will exchange 
their words for completely new ones with a relatively high speed when 
their language contains relatively long words. In both cases, the difference 
between COGNsim and ASJPsim should be inversely correlated with mean 
word length, because the rate of loss of COGNsim over time would approach 
the rate of loss of ASJPsim when words are longer. Now, these are perhaps 
not the intuitively most plausible hypotheses and we would not have pro- 
duced them had it not been for the fact that either one or the other is strongly 
supported by our data. As a measure of word length, we take the first word 
for each concept in our 40-item lists and average the number of ASJPcode 
segments (if the translation of the concept is phrasal we still only take the 
first word). Finally we average the averages within language families to 
obtain mmWL. 

Table 3 contains the data needed to test these different predictions. First 
we test whether the difference between mASJPsim and mCOGNsim is 
positively correlated with mSR, and thereby the hypothesis that languages 
change phonologically faster the more phonemes they have. A linear cor- 
relation of mSR and mDIFF indeed shows a positive correlation of r = .21, 
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but it is small and non-significant, p =.32. As another way of looking at the 
relationship we can test whether mRATIO, i.e., the mean of the ratios of 
log(COGNsim) to log(ASJPsim), is negatively correlated with mSR. This 
Is, indeed, the case, but again the correlation is small and non-significant, 
r = —23, p = 27. Thus, judging from the evidence from language family 
averages, the hypothesis that languages change faster phonologically the 
more segments they have is not borne out — in spite of the plausible nature 
of the hypothesis. 

We now go on to test whether differences in word length explain the 
variability in differences between cognate similarities and ASJP similarities 
across language families. Again referring to table 3 we first correlate mDIFF 
and mmWL, which yields a solid r =—.50, p = .01. Again we alternatively test 
for mRATIO and find that this property is positively correlated with mmWL, 
r= .53, p < 0l. 

Somewhat unexpectedly we have found that, judging by averages across 
families, lexical replacement increases as a function of word length or, 
alternatively, phonological change decreases as a function of word length. In 
contrast, lexical replacement and phonological change are not significantly 
affected by the sizes of segment inventories.’ Later in this paper we discuss 
the competing explanations for the correlations involving word length. But 
before that, we would like to establish the findings more firmly by looking at 
the behavior of individual words within language families. 


4. Correlations across items within families 


We have seen in the previous section that language families with longer 
words tend to have fewer cognates relative to their overall lexical similarity 
than do families with shorter words. Does this correlation apply to the words 
themselves or only to the families? This question can be addressed in the 
eleven language groups with checkmarks in table 1, for which the sources of 
the word lists also indicate which words are cognate. With this information it 
is possible to calculate mASJPsim and mCOGNsim separately for individual 
items on the ASJP 40-item list, averaged across language pairs in the group. 
DIFF and RATIO are then defined for each item as mCOGNsim - mASJP- 
sim and log(mCOGNsim)/log(mASJ Psim), respectively. One new property 
of items, mAsimC, is defined as mASJPsim calculated only for those pairs 
of words identified as cognate. mAsimC indicates the degree of phonological 
similarity between words that may have undergone phonological change but 
have not undergone lexical replacement. To ensure representative samples, 
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Table 3. Data for correlations with numbers of segments and mean word length 


Family mDIFF mRATIO mSR mmWL 
AA 10.83 0.644 27.36 4.06 
Alt 17.72 0.545 23.83 3.58 
An 8.57 0.795 20.65 4.53 
AuA 24.84 0.469 25.25 3.65 
Aus 12.64 0.686 18.35 5.21 
Car 31.58 0.407 19.30 4.68 
Dra 10.02 0.746 19.90 3.96 
HM 61.69 0.133 28.33 3.29 
IE 14.58 0.570 26.80 3.95 
Jap 27.30 0.339 20.40 3.79 
May 18.68 0.589 24.96 3.66 
MGe 37.25 0.353 23.44 3.82 
MZ 19.68 0.577 19.50 3.78 
NC 24.37 0.436 24.47 3.73 
NDe 32.31 0.302 27.93 3.24 
NS 5.87 0.775 25.73 3.83 
Que 27.80 0.328 21.56 4.41 
Sal 12.77 0.641 32.67 5.06 
ST 52.23 0.229 27.52 3:37 
TK 41.12 0.284 25.00 2.98 
TNG 14.75 0.653 20.49 4.27 
Tor 14.12 0.573 21.86 4.17 
UA 32.54 0.531 20.42 4.42 
WP 25.47 0.467 18.29 4.63 


Legend: mDIFF: the difference between mASJPsim and mCOGNsim; mRATIO: the mean ratio of 
log(COGNsim) to log(ASJPsim); mSR: the mean of the number of different phonological segments in 
the word lists pertaining to the family; mmWL: the average word length within word lists averaged 
across the languages the family. 


all these quantities are calculated only for items that are attested in at least 
70% of the languages in the group. Also, mAsimC is defined for an item only 
if mCOGNSim is above 0%, because otherwise there are no cognate pairs for 
the item. RATIO is defined only if mCOGNsim is above 0% and mASJPsim 
Is strictly between 0% and 100%, in order to avoid logarithms of nonpositive 
numbers or division by 0. 
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Table 4. Correlations within families 


Group mAs mCs DIFF RATIO mAsimC 
Afras —.359 —.193 —.369 415 .180 
MoKh —.370 —.326 —317 .443 —.269 
Morob —.083 .090 —135 .209 .010 
MiaoY —.092 —. 401 .080 —.080 —462 
Mayan —.569 —.389 —.495 .609 —.016 
MiZo —.409 —445 —.092 288 —.376 
Athap —.342 —.259 —.255 .393 —.166 
LoloB —.063 —402 .154 .091 —447 
Kadai .038 —.044 .090 —.076 —.140 
Kamas —.368 —.306 —.160 .157 —.004 
Huon —.267 —.519 .100 —149 —119 
Mean —.262 —.290 —127 .209 —164 
t(10) 4.68 5.12 1.95 2.82 2.65 


Legend: mAs: mASJPsim; mCs: mCOGNsim 


Cases where a concept is translated by a phrase, i.e., two or more words 
separated by spaces in the data source, rather than by a single word, are 
treated differently for different purposes. For the estimates of the mean 
length of words used in previous sections only the first word in a phrase 
was counted. This seems appropriate when a word list is used as a random 
sample of words in a language. However, for comparing the way that specific 
concepts are expressed across concepts and languages, as is done in the pres- 
ent section, the whole phrase is counted. When translations exceed a single 
word, the properties mASJPsim and mCOGNSim refer to the entire phrase 
throughout this paper. The data used are provided as an online appendix. 

Mean word length is now correlated across lexical concepts with each of 
the similarity properties defined in the first paragraph of this section. That is, 
for each of the 40 items pertaining to the ASJP word lists of each language 
group we determine their average length as well as mASJPsim, mCOGN- 
sim, etc., and then calculate Pearson's r. Table 4 provides the correlations 
for each of the eleven language groups, ordered as in table 1. The table also 
gives the mean correlation across groups and the value of Student's ¢ (with 10 
degrees of freedom) for testing whether the mean correlation differs from 0. 

Most of the individual correlations are weak, possibly because of limited 
variation within language groups. They are collectively quite consistent 
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across groups, however, producing significant effects (p < .05) for all but 
one measure, namely DIFF. Both measures of similarity, mASJPsim and 
mCOGNsim, show significantly less similarity when items are represented 
by longer words than when they are represented by shorter ones. Since 
similarity is calculated across the same pairs of languages for each item, 
it follows that time depth is the same across items and therefore that items 
represented by longer words are less stable through time than items repre- 
sented by shorter ones. The two relative measures, DIFF and RATIO, are 
consistent with table 3 in showing less lexical similarity relative to pho- 
nological similarity for translational equivalents that have longer words, 
although this effect is significant only for RATIO. Finally, the significantly 
negative mean correlation for mAsimC implies that longer words undergo 
more phonological change even if they are not replaced. It follows that the 
lower lexical similarity of longer words relative to phonological similarity is 
a consequence of more lexical change rather than less phonological change. 
In summary, longer words are more likely to be replaced and more likely to 
change phonologically if they are not replaced. 

The significant negative correlation between word length and stability 
can be extended to the entire ASJP database. For this purpose, mean word 
length is calculated for each item in each language family and then aver- 
aged across families. Stability 1s defined as in Holman et al. (2008b), except 
that their similarity measure is replaced by mASJPsim. More specifically, 
mASJPsim is first determined for each item in each of the language genera 
established by Dryer (1989, 2011), who defines genera as the most inclusive 
groups descended from a common ancestral language spoken within the 
last 3500 to 4000 years. Then stability is equal to the weighted mean of 
mASJPsim across genera, with each genus weighted by the square root of 
the number of language pairs in the genus. The correlation between stability 
and mean word length across the 40 items in the ASJP list proves to be 
substantial, r — —47, p « .01. The other correlations in table 4 cannot be 
extended in this way because judgments of cognacy are not available for 
most language groups. 

The negative correlation between word length and stability, whether 
measured by mASJPsim, mCOGNsim, or mAsimC, can be explained by the 
finding of Pagel et al. (2007) that frequency of use is positively correlated 
with stability, given that frequent words tend to be shorter than infrequent 
ones (Zipf 1935). This explanation, however, does not account for the new 
observation, replicated both across languages and across items, that short 
words show less lexical change relative to phonological change. 
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5. Discussion and conclusion 


It is a central concern to historical linguistics to identify causes for language 
change, and many causes are known to exist. External ones include effects 
of social stratification, lexical adaptation (the creation of new words for 
new concepts), borrowing and other effects of language contact, imperfect 
vertical transmission, etc.; among internal ones we can mention the spread 
throughout the lexicon of sound changes, analogy, grammaticalization, etc. 
The introduction of glottochronology widened the search for factors that 
might affect the rate of language change. Examples of such factors would 
be borrowing or word taboo. To date, however, not a single factor, either 
external or internal to languages, has been identified which systematically 
affects rates of change. Population size is an example of a proposed external 
factor influencing the rate of language change which has not stood up to a 
quantitative scrutiny (Wichmann et al. 2008; Wichmann and Holman 2009). 
The already-mentioned relation between frequency and stability identified 
by Pagel et al. (2007) is an example of language-internal factors regulat- 
ing the rate of change, but does not exemplify a factor that systematically 
predicts that one language changes faster than another. 

Thus, our major finding in this paper, namely that longer words tend to 
be replaced faster than shorter words both within and across languages, is 
unique. One implication of the finding is that critics of glottochronology 
for the first time have a weapon other than case studies to attack the idea 
that lexical change is regular enough to be a useful tool for dating language 
divergence. The weapon is rather blunt, however, since the effect of aver- 
age word length is not overwhelming even if statistically significant, and 
our work on the ASJP dating technique (Holman et al. 2011) still shows a 
high degree of regularity in the decay of ASJPsim over time. In fact, the 
present finding that the effect of word length is stronger for COGNsim than 
for ASJPsim may explain why some studies of glottochronology report less 
regular results than do Holman et al. Thus, the ‘weapon’ may serve practi- 
tioners of lexically-based dating techniques better than their critics since it 
can potentially be used to improve those techniques. 

We have not yet addressed possible explanations for our finding, and can- 
not hope to do so conclusively at this point. One possibly relevant factor is 
the differing information provided by long and short words for judgments 
of cognacy. Maybe false cognates are more likely to be accepted for short 
words. This sort of inaccuracy is less likely to be important if cognacy is 
inferred by means of regular sound correspondences rather than judged by 
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inspection. Cognitive biases are even more reduced for ASJPsim, which is 
normalized by word length and calculated automatically. 

The other possible explanatory factor is the process of language change 
itself. Why should speakers of languages that have longer words in their basic 
vocabulary replace these words more frequently than speakers of languages 
that have shorter words? The reason is not, for instance, that speakers want 
to replace longer words with shorter ones, because we draw our observations 
from the current state of languages, where the ones that have the longer 
words have replaced earlier words faster. 

Our tentative hypothesis is that if a language has rich word-formation 
strategies at its disposal such that many ofthe words in a language are formed 
by derivation and compounding, then the words in the language will tend to 
be longer and also will tend to be replaced more often. An implication is that 
the creation of complex lexemes is generally preferred over the creation of 
simplex ones. A problem for testing this hypothesis is that the ASJP word 
lists are not based on a consistent definition of what a word is. Generally, the 
word lists simply reflect whatever is given as a translational equivalent for 
each concept in a particular linguistic source, with the exception that tran- 
scribers have stripped off inflectional elements and class markers when their 
knowledge of the languages allowed them to do so. This is a minor caveat, 
since the data can be revisited and adjusted for consistency. A more serious 
problem is that each ofthe many lexical items used in this study would ideally 
have to be tagged for its status as simplex, derived, compounded, or phrasal 
(and maybe other categories, as well as various combinations) in order to 
provide more substance to our hypothesis. Thus, the further investigation of 
the proposed relation between differences in word formation strategies and 
rates of lexical replacement seems to call for a larger, collective effort and 
several future case studies. 


Online appendix: word lists transcribed in ASJPcode with cognate 
encoding from the literature 


The online appendix is available at 
http://spraakbanken.gu.se/sites/spraakbanken.gu.se/files/cognatedata.zip 


The file contains data used in section 4, i.e., ASJP word lists with informa- 
tion on cognacy, mostly drawn from the literature. The format of the data 
sheet is explained in the file “description.pdf”. 
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Appendix: matching of languages in the lexicostatistical literature 
with languages represented in the ASJP database used in this study 


The data in tables 23 were produced by comparing published cognate 
percentages with ASJPsim for languages that are represented both in the 
lexicostatistical literature and in the ASJP database (Wichmann et al. 2012), 
where the latter has been updated to include as many of the languages in the 
former as possible and, whenever possible, the actual data on which the cog- 
nate counts were based. Language groups for which ASJPsim is calculated 
from the same dataset which was used for counting cognate percentages 
are identified by a star following the language group name, and when there 
are but a handful of exceptions where data from other sources have been 
added, or when data and cognate judgments are from the same author but 
in different publications, a star is given in parenthesis (note that ASJPsim is 
based on the reduced, 40-item version of the Swadesh list whereas published 
cognate percentages are based on other versions or derivatives, as described 
in section | of this paper, so identity of data sources does not mean complete 
identity of data). When authors providing cognate counts do not provide 
the word lists used, alternative data sources are used. All sources for ASJP 
word lists are found at http:// lingweb.eva.mpg.de/asjp/index.php/ASJP. The 
doculects represented in our database are uniquely identified by their names. 
In the lists below we provide the language family names (in bold), language 
group names (in bold italics), references to sources (in parentheses), names 
of languages in the sources for cognate counts (in normal font), the ISO 
639-3 identifier, and the ASJP designation (in capital letters). Exceptions to 
these patterns are Iwaidjan (Australian) and Mayan, where cognate judg- 
ments were made directly in relation to ASJP word lists; thus, only the ISO- 
code and the ASJP designation are given in these cases. Language names are 
joined by + in cases where the source gives separate cognate percentages for 
varieties with the same ISO-code, and these percentages are averaged in the 
present calculations. In the ASJP database the language designations make 
use of the underscore characters instead of spaces (e.g., HARARI 2). In this 
appendix the underscores are replaced by spaces (e.g., HARARI 2) to allow 
for more line breaks. The information provided in this appendix is intended 
to ensure replicability of our results. 


Afro-Asiatic: Afrasian* (Militarev 2000): Lebanese Arabic, ape, ARABIC NORTH 
LEVANTINE; Tigrai, tig, TIGRIGNA; Amharic, amh, AMHARIC 3; Harari, har, 
HARARI2; Mehri, gdq, MEHRI 2; Jibbali, shv, SHEHRI; Soqotri, sat, SOQOTRI2; 
Siwa, siz, SIWI; Ghadames, gha, GHADAMES 2; Qabyle, kab, KABYLE; Ahaggar, 
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thv, TAMAHAQ TAHAGGART 2; Zenaga, zen, ZENAGA 2; Hausa, hau, HAUSA 3; 
Bole, bol, BOLE 2; Beja, bej, BEJA 2; Oromo, hae, EASTERN OROMO 2; Dahalo, 
dal, DAHALO 2; Kefa, kbr, KEFA 2. Cushitic* (Bender 1971): Beja, bej, BEJA; 
Bilen, byn, BILIN 2; Qimant, ahg, KEMANT 2; Xamtanga, xan, XAMTANGA 2; 
Awngi, awn, AWNGI 2; Hadiyya, hdy, HADIYYA 2; Libido, liq; LIBIDO; Kem- 
bata, ktb, KAMBAATA 2; Alaba, alw, ALABA; Sidamo, sid, SIDAMO 2; Derasa, 
drs, GEDEO 2; Burji, bji, BURJI 2; Afar, aar, AFAR 2; Saho, ssy, SAHO 2; Baiso, 
bsw, BAISO 2; Arbore, arv, ARBORE 2; Dasenech, dsh, DAASANACH 2; Somali, 
som, SOMALI 2; Rendille, rel, RENDILLE 2; Mecha, gaz, MECHA OROMO; 
Borena, gax, BORANA OROMO 2; Qottu, hae, EASTERN OROMO; Konso, kxc, 
KOMSO 2; Gidole, gdl, GIDOLE 2; N Bussa, dox, BUSSA 2; Gawwada + Gobeze + 
Werize, gwd, GAW WADA 2; Tsamai, tsb, TSAMAI 2; Iraqw, irk, IRAQW 2. Ethio- 
semitic”? (Bender 1971): Tigre, tig, TIGRE; Tigrinya, tir, TIGRINYA; Amharic, 
amh, AMHARIC; Argobba, agj, ARGOBBA; Zway, zwa, ZWAY; Walani, stv, 
WALANI SILTE; Harari, har, HARARI; Gafat, gft, GAFAT; Soddo, gru, SODDO; 
Mesmes, mys, MESMES 2; Mesqan, mvz, MESQAN; Chaha + Geto, sgw, GETO; 
Innemor, ior, INNEMOR. Omotic* (Bender 1971): Dime, dim, DIME; Ari, aiw, 
ARI; Banna, amf, BANNA; Maji, mdx, MAJI; Sheko, she, SHEKO; Nao, noz, 
NAO; Southern Mao, myo, SOUTHERN MAO; Shinasha, bwo, SHINASSHA 2; 
Kefa, kbr, KEFA; Mocha, moy, MOCHA 2; Janjero, jnj, JANJERO; Bencho, bcq, 
BENCHO; Male, mdy, MALE ETHIOPIA; Basketo, bst, BASKETO; Welamo, 
wal, WELAMO; Kullo, dwr, KULLO; Dorze, doz, DORZE; Oyda, oyd, OYDA; 
Kacama, kcx, KACHAMA; Koyra, kqy, KOYRA; Zayse + Zergulla, zay, ZAYSE. 


Altaic: Turkic* (Troike 1969): Turkish, tur, TURKISH 2; Azerbaijani (Azeri), azj, 
AZERBAIJANI NORTH:2; Karachai, krc, KARACHAY BALKAR; Crimean 
Tatar, tat, CRIMEAN TATAR; Kazan Tatar, tat, KAZAN TATAR; Misher Tatar, 
tat, MISHER TATAR. 


Australian: Daly* (Tryon 1974): Mullukmulluk, mpb, MULLUKMULLUK; 
Yunggor, zml, YUNGGOR; Kamor, xmu, KAMOR; Marithiel, mfr, MARITHIEL; 
Marityabin, zmj, MARITYABEN; Maridan, zmd, MARIDAN; Maramanadji, 
zmm, MARAMANADJI; Marengar, zmt, MARENGAR; Ami, amy, AMI; Manda, 
zma, MANDA; Pungupungu, wdj, PUNGUPUNGU; Wadyginy, wdj, WADJIGIN Y; 
Ngangikurr, nam, NGANGIKURRUNGGURR. Iwaidjan* (Robert Mailhammer 
p.c., 2011): amg, AMURDAK; ibd, IWAIDJA; mph, MAWNG. Mayi* (Breen 
1981): Ngawun, nxn, NGAWUN; Mayi-Kulan, mnt, MAYKULAN; Mayi-Yapi, 
mnt, MAYI YAPI; Mayi-Thakurti, mnt, MAYI THAKURTI; Mayi-Kutuna, xmy, 
MAYAGUDUNA. Paman (Sommer 1976): Bariman Gutinhma, zmv, PARIMAN- 
KUTINMA; Umbuygamu, umg, UMBUYKAMU; Lamalama, lby, LAMALAMA 
COASTAL. West Australian (O'Grady 1966): Nyungumarda, nna, NYANGU- 
MARTA; Yulbaridja, mpj, YULPARIJA; Warburton Ranges, ntj, NGAANYAT- 
JARRA; Pandjima, pnw, PANYTYIMA; Jindjibandi, yij, YINDJIBARNDE; 
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Ngaluma, nrl, NGALOOMA; Wadjeri, wbv, WAJARRI. West Barkly (Chadwick 
1979): Wambaya, wmb, WAMBAYA; Djingili, jig, DJINGILI; Gudandji, nji, 
GUDANIJI. Worrorran™ (McGregor and 2009): Wunambal, wub, WUNAMBAL; 
Gunin Kwini, gww, GUNIN/KWINI; Ngarinyin, ung, NGARINYIN; Unggumi, 
unp, UNGGUMI; Bunuba, bck, BUNABA; Gooniyandi, gni, GOONIYANDE Kija, 
gia, KITJA; Miriwoong, mep, MIRIWUNG; Walmajarri, wmt, WALMAJARRI. 


Austro-Asiatic: Mon-Khmer* (Peiros 1998) Jeh, jeh, JEH; Bahnar, bdq, 
BAHNAR; Chrau, crw, CHRAU; Kui, kdt, KUI THAILAND; Khmer, khm, 
KHMER; Semai, sea, SEMAI; Mon, mnw, MON; Nyakur, cbn, NYAKUR; Viet- 
namese, vie, VIETNAMESE; Ruc, scb, RUC; Wa, wbm, WA; Deang, pce, DEANG; 
Khmu, kjg, KHMU; Ksinmul, puo, KSINMUL; Khasi, kha, KHASI; Mundari, unr, 
MUNDARI. 


Austronesian: Austronesian (Dyen 1965): Banoni, bem, BANONI; Saposa, sps, 
TAIOF,; Iai, iai, IAAI; Tongan, ton, TONGAN; Tanna, tnn, NORTH TANNA; 
Fiji, fij, FIJIAN; Zabana, kji, ZABANA; Roviana, rug, ROVIANA; Acira, 
adz, ADZERA; Yapese, yap, YAPESE. Malagasy* (Vérin et al. 1969): Betsileo 
Ambositra, pl, MALAGASY AMBOSITRA; Antaisaka, big, MALAGASY 
ANTAISAKA; Antambahoaka, bjq MALAGASY ANTAMBAHOAKA; 
Antankarana, xmw, MALAGASY ANTANKARANA; Bara, bhr, MALAGASY 
BARA; Betsimisaraka, bmm, MALAGASY BETSIMISARAKA; Betsileo 
Fianarantsoa, bjg, MALAGASY FIANARANTSOA; Mahafaly, tdx, MALAGASY 
MAHAFALY; Merina, plt, MALAGASY MERINA; Sakalava 1, skg, MALAGASY 
SAKALAVA 1; Sakalava 2, skg, MALAGASY SAKALAVA 2; Sihanaka, plt, 
MALAGASY SIHANAKA; Taimoro, plt, MALAGASY TAIMORO; Antandroy 1, 
tdx, MALAGASY TANDROY 1; Antandroy 2, tdx, MALAGASY TANDROY 2; 
Tsimihety, xmw, MALAGASY TSIMIHETY; Vezo, skg, MALAGASY VEZO; 
Zafisoro, bjq, MALAGASY ZAFISORO. Melanesian* (Z’Graggen 1969): Manam, 
mva, MANAM; Sepa, spb, SEPA; Gedaged, gdd, GEDAGED; Bilbil, brz, BILBIL; 
Takia, tbc, TAKIA; Matukar, mjk; MATUKAR. Morobe* (Hooley 1971): Wagao, 
bzh, WAGAU; Mapos, bzh, MAPOS; Manga, kby, MANGA; Patep, ptp, PATEP; 
Kumaru, ksl, KUMARU; Zenag, zeg, ZENAG; Towangara, goc, TOWANGARA; 
Sambio, tbx, SAMBIO; Dambi, dac, DAMBI; Piu, pix, PIU; Buasi, val, BUASI; 
Latep, zeg, LATEP; Dunguntung, mpl, DUNGUNTUNG; Dangal, mcy, DANGAL; 
Silisili, mpl, SILISILI; Bubwaf, mpl, BUBWAF; Dagin, lbq, DAGIN; Azera, adz, 
AZERA; Wampar, lbq, WAMPAR; Sirak, srf, SIRAK; Guwot, gve, GUWOT; 
Duwet, gve, DUWET; Musom, msu, MUSOM; Sukurum, zsu, SUKURUM; 
Sirasira, zsa, SIRASIRA; Maralango, mcy, MARALANGO; Wampur, waz, 
WAMPUR; Mari, hob, MARI; Onank, una, ONANK; Yaros, adz, YAROS; Amari, 
adz, AMARI; Labu, Ibu, LABU; Bukaua, buk, BUK AUA; Kela, kcl, KELA; Kaiwa, 
kbm, KAIWA; Sipoma, sij, SIPOMA; Hote, hot, HOTE; Yamap, ymp, YAMAP; 
Jabem, jae, JABEM; Tami, tmy, TAMI; Malasanga, mqz, MALASANGA; Gitua, 
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ggt, GITUA; Lukep, apr, LUKEP; Mangap, mna, MANGAP; Barim, bbv, BARIM; 
Mutu, tuc, MUTU; Tuam, tuc, TUAM; Sio, xsi, SIO; Nengaya, met, NENGAYA; 
Roinji, roe, ROINJI; Arawe, aaw, ARAWE; Maleu, mgl, MALEU; Nakana, nak, 
NAKANA; Halia, hla, HALIA; Gedaged, gdd, GEDAGED. New Hebrides“ 
(Tryon 1973): Toga (Torres), Iht, TOGA; Mosina (Banks), msn, MOSINA; Peterara 
(Maewo), mwo, CENTRAL MAEWO; Nduindui (Aoba), nnd, WEST AMBAE; 
Sakau (Santo), sku, SAK AO; Malo (Santo), mla, NORTH MALO; Fortsenal (Santo), 
frt, FORTSENAL; Raga (Pentecost), Iml, RAGA; Sa (Pentecost), sax, SA; Dakaka 
(Ambrym), bpa, DAKAKA BAIAP; Aulua (Malekula), aul, AULUA; Big Nambas 
(Mal), nmb, BIG NAMBAS UNMET; Lewo (Epi), lww, LEWO FILAKARA; 
Nguna (Efate), llp, NORTH EFATE NGUNA; Sie (Erromanga), erg, SIE; Lenakel 
(Tanna), tnl, LENAKEL LENAUKAS. Philippines (Llamzon and Martin 1976): 
Agta, agt, AGTA; Atta, att, ATTA PAMPLONA; Balangaw, blw, BALANGAW; 
Batak, bya, BATAK PALAWAN; Bilaan Koronadal, bpr, BILAAN KORONADAL; 
Bilaan Sarangani, bps, BILAAN SARANGANI; Binukid, bkd, BINUKID; Bontoc, 
bnc, CENTRAL BONTOC; Dumagat, dgc, DUMAGAT CASIGURAN; Gaddang, 
gdg, GADDANG; Amganad Ifugao, ifa, IFUGAO AMGANAD; Batad Ifugao, ifb, 
IFUGAO BATAD; Bayninan Ifugao, ify, IFUGAO BAYNINAN; Ilonggot, ilk, 
ILONGOT KAKIDUGEN, Inibaloi, ibl, INIBALOI; Isneg, isd, ISNEG; Itbayaten, 
ivv, ITBAYATEN BATANES ISLANDS; Itneg, itb, ITNEG BINONGAN, Ivatan, 
ivv, IVATAN BATANES ISLANDS; Kalagan, klg, KALAGAN; Kalinga, knb, 
KALINGA GUINAANG; Kallahan Kayapa, kak, KALLAHAN KAYAPA 
PROPER; Kallahan Keleyqiq, ify, KALLAHAN KELEYQIQ IFUGAO; Kankanay, 
xnn, KANKANAY NORTHERN; Mamanua, mmn, MAMANWA; Ata Manobo, 
atd, MANOBO ATA; Dibabawon Manobo, mbd, MANOBO DIBABAWON; 
Ilianen Manobo, mbi, MANOBO ILIANEN; Kalamsig Manobo, mta, 
MANOBO KALAMANSIG COTABATO; Sarangani Manobo, mbs, MANOBO 
SARANGANI; Tigwa Manobo, mbt, MANOBO TIGWA; Western Bukidnon 
Manobo, mbb, MANOBO WESTERN BUKIDNON; Mansaka, msk, MANSAKA; 
Siasi, sml, SAMAL; Sambal, sbl, SAMBAL BOTOLAN; Sangil, snl, SANGIL 
SARANGANI ISLANDS; Sangir, sxn, SANGIR; Sindangan Subanon, syb, 
SUBANUN SINDANGAN; Siocon Subanon, suc, SUBANON SIOCON; Tboli, tbl, 
TBOLI TAGABILI; Aborlan Tagbanwa, tbw, TAGBANWA ABORLAN; Kalamian 
tagbanwa, tbk, TAGBANWA KALAMIAN; Tausug, tsg, TAUSUG; Tagalog, tgl, 
TAGALOG; Cebuano, ceb, CEBUANO; Hiligaynon, hil, HILIGAYNON; Waray, 
war, WARAY WARAY; Ilocano, ilo, ILOKANO; Bicol, bel, CENTRAL BICO- 
LANO; Pampango, pam, KAPAMPANGAN; Pangasinan, pag, PANGASINA; 
Tagakaolo, klg, KALAGAN TAGAKAOLO; Yakan, yka, YAKAN; Sibutu, 
ssb, SIBUTU SOUTHERN SAMA; Kapul, abx, INABAKNON; Palun Mapun, 
sjm, MAPUN; Maranao, mrw, MARANAO; Tasaday, mdh, MAGUINDANAO; 
Kiniray’a, krj, KINARAY-A; Masbatefio, msb, MASBATENYO; Sorsogonon, 
bks, NORTHERN SORSOGON; Butuanon, btw, BUTUANON; Hanunoo, hnn, 
HANUNOO; Itawes, itv, ITAWIT; Ibanag, ibg, IBANAG; Yogad, yog, YOGAD; 


Bereitgestellt von | De Gruyter / TCS 
Angemeldet 
Heruntergeladen am | 16.10.19 15:56 


268 Søren Wichmann and Eric W. Holman 


Aklanon, akl, AKLANON; Capiznon, cps, CAPIZNON; Cagayanzillo, cgc, 
KAGAYANEN; Romblonon, rol, ROMBLOMANON,; Tiruray, tiy, TIRURAY; 
Mandaya, mry, MANDAYAN CARAGA. Yapen (Anceaux 1961): Woi, wbw, 
WOI; Pom, pmo, POM; Marau, alu, MARAU; Ansus, and, ANSUS; Papuma, ppm, 
PAPUMA; Munggui, mth, MUNGGUI; Serui Laut, seu, SERUI-LAUT; Ambai, 
amk, AMBAE, Wadapi-Laut, amk, WADAPI LAUT; Wabo, wbb, WA BO; Kurudu, 
kjr, KURUDU; Wandamen, wad, WANDAMEN; Dusner, dsn, DUSNER; Ron, rnn, 
RON; Biak, bhw, BIAK; Waropen, wrp, WAROPEN; Mor, mhz, MOR; Irarutu, irh, 
IRARUTU. 


Carib: Cariban (Villalon 1991): Yabarana, yar, YABARANA; Panare, pbh, ENAPA 
WOROMAIPU; Pemon + Kamarakoto + Taurepan, aoc, PEMON; Makushi, mbc, 
MACUSHI, Oayana, way, WAYANA; Carib, car, KALINA; Yukpa, yup, YUKPA; 
Bakairi, bkq, BAKAIRI; Makiritare, mch, MAQUIRITARI; Hianacoto-Umaua, 
cbd, CARIJONA. 


Dravidian: Dravidian™ (Andronov 2001): Tamil, tam, TAMIL; Malayalam, mal, 
MALAYALAM; Kannada, kan, KANNADA; Telugu, tel, TELUGU; Kolami, kfb, 
NORTHWESTERN KOLAMI, Parji, pci, PARJI; Gondi, ggo, ADILABAD GONDI; 
Kurukh, kru, KURUKH; Malto, mjt, SAURIA PAHARIA; Brahui, brh, BRAHUI. 


Hmong-Mien: Miao-Yao* (Peiros 1998): Hmu, hea, HMU; Xiangxi Hmong (or 
Xx), mmr, XIANGXI HMONG; Hmong Njua, hnj, HMONG NJUA; Bunu, bwx, 
BUNU; She, shx, SHE CHINA; yao, ium, YAO. 


Indo-European: Indo-European (Dyen et al. 1992): Irish, gle, IRISH GAELIC; 
Welsh, cym, WELSH; Breton, bre, BRETON; Rumanian, ron, ROMANIAN 2; 
Vlach, rup, VLACH; Italian, ita, ITALIAN; French, fra, FRENCH; Provençal, frp, 
ARPITAN; Spanish, spa, SPANISH; Portuguese, por, PORTUGUESE; Catalan, 
cat, CATALAN; German, deu, STANDARD GERMAN; Dutch, nld, DUTCH; 
Afrikaans, afr, AFRIKAANS; Flemish, vls, WESTVLAAMS; Frisian, fry, 
FRISIAN WESTERN; English, eng, ENGLISH; Takitaki, srn, SRANAN TONGO; 
Swedish, swe, SWEDISH; Danish, dan, DANISH; Riksmal, nob, NORWEGIAN 
BOKMAAL; Icelandic, isl, ICELANDIC; Faroese, fao, FAROESE; Lithuanian, lit, 
LITHUANIAN; Latvian, lav, LATVIAN; Lusatian L, dsb, LOWER SORBIAN 2; 
Lusatian U, hsb, UPPER SORBIAN; Czech, ces, CZECH; Slovak, slk, SLOVAK; 
Polish, pol, POLISH; Ukrainian, ukr, UKRAINIAN; Byelorusssian, bel, 
BELARUSIAN; Russian, rus, RUSSIAN; Bulgaria, bul, BULGARIAN; Slovenian, 
slv, SLOVENIAN; Serbocroatian, srp, SERBOCROATIAN; Singhalese, sin, 
SINHALA; Kashmiri, kas, KASHMIRI; Lahnda, pnb, WESTERN PANJABI 
SHAHPUR; Marathi, mar, MARATHI; Gujarati, guj, GUJARATI; Panjabi, pan, 
PUNJABI MAJHI; Hindi, hin, HINDI; Bengali, ben, BENGALI; Nepali, nep, 
NEPALI; Ossetic, oss, DIGOR OSSETIAN; Afghan, pbu, NORTHERN PASHTO; 
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Waziri, pst, BANNU PASHTO; Wakhi, wbl, CENTRAL GOJAL WAKHI; Per- 
sian, pes, PERSIAN; Tadzik, tgk, TAJIK; Greek, ell, GREEK; Armenian, hye, 
WESTERN ARMENIAN; Albanian T, als, ALBANIAN TOSK. 


Japonic: Japonic* (Hattori 1961): Tokyo, jpn, TOKYO JAPANESE; Kyoto, jpn, 
JAPANESE KYOTO; Naha, ryu, NAHA; Shuri, ryu, SHURI; Yonamine, xug, 
YONAMINE. 


Mayan: Mayan* (Cecil H. Brown p.c., 2011): CHICOMUCELTEC, cob; SOUTH- 
ERN CAKCHIQUEL SAN ANDRES ITZAPA, ckf; jai, JACALTEC; qum, 
SIPAKAPENSE; kjb QANJOBAL SANTA EULALIA; to, TOJOLABAL; 
usp, USPANTEKO; ctu, CHOL TUMBALA; mhc, MOCHO; poa, POCOMAM 
EASTERN; quc, CENTRAL QUICHE; kek, EASTERN KEKCHI CAHABON; 
ixj, IXIL CHAJUL; mop, MOPAN; quv, SACAPULTECO SACAPULAS 
CENTRO; ttc, TECO TECTITAN, tzj, TZUTUJIL SAN JUAN LA LAGUNA; knj, 
ACATECO SAN MIGUEL ACATAN; caa, CHORTI; mam, MAM NORTHERN; 
pob, POQOMCHI WESTERN, tzb, TZELTAL BACHAJON; agu, AGUACATEC; 
chf, CHONTAL TABASCO; cnm, CHUJ; lac, LACANDON; tzz, ZINACANTAN 
TZOTZIL; hva, HUASTEC; itz, ITZAJ; yua, MAYA YUCATAN. 


Macro-Ge: Ge (Wilbert 1962): Apinaye, apn, APINAYE; Creye, xre, KREYE; 
Canela, ram, APANIEKRA; Craho, xra, KRAHO; Pucobye, gvp, PYKOBJE; 
Suya, suy, SUYA; Cayapo, txu, KAYAPO; Shavante, xav, XAVANTE; Sherente, 
xer, XERENTE. 


Mixe-Zoque: Mixe-Zoque* (Cysouw et al. 2006): North Highland Mixe, mto, 
NORTH HIGHLAND MIXE; South Highland Mixe, mxp, SOUTH HIGHLAND 
MIXE; Lowland Mixe, mco, LOWLAND MIXE; Sayula Popoluca, pos, SAYULA 
POPOLUCA; Oluta Popoluca, plo; OLUTA POPOLUCA; Texistepec Zoque, poq, 
TEXISTEPEC ZOQUE; Soteapan Zoque, poi, SOTEAPAN ZOQUE; Santa Maria 
Chimalapa Zoque, zoh, MARIA CHIMALAPA; San Miguel Chimalapa Zoque, 
zoh, MIGUEL CHIMALAPA; Chiapas Zoque, zoc, CHIAPAS ZOQUE; 


Na-Dene: Athapaskan* (Hoijer 1956): Hare, scs, HARE; Chipewyan, chp, CHIPE- 
WYAN; Beaver, bea, BEAVER; Carrier, crx, CARRIER; Kutchin, gwi, KUTCHIN; 
Sarcee, srs, SARCEE; Galice, gce, GALICE; Kato, ktw, KATO; Mattole, mvb, 
MATTOLE; Hupa, hup, HUPA 2; Navaho, nav, NAVAHO; San Carlos, apw, SAN 
CARLOS; Chircahua, apm, CHIRICAHUA; Jicarilla, apj, JICARILLA; Lipan, apl, 
LIPAN. 


Niger-Congo: Atlantic (Sapir 1971): Fula, fuc, FULA; Wolof, wol, WOLOF; Serer, 
srr, SERER SINE; Lehar, cae, LEHAR; Safen, sav, SAFEN; Non, snf, NON; Ndut, 
ndv, NDUT FALOR; Fogny, dyo, JOLA; Manjaku, mfv, MANJACA CHURO; 
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Papel, pbo, PAPEL; Balanta, ble, BALANTA; Biafada, bif, BIAFADA; Pajade, pbp, 
PAJADE; Nalu, naj, NALU; Bijago, bjg, BIJOGO; Temne, tem, TEMNE; Mmani, 
buy, MMANI; Sherbro, bun, SHERBRO; Krim, krm, KRIM; Kisi, kqs, KISSI; 
Gola, gol, GOLA. Benue-Congo (Bennett and Sterk 1977): Nupe, nup, NUPE; 
Gade, ged, GA DE; Igbira, igb, IGBIRRA; Idoma, idu, IDOMA; Eloyi, afo, ELOY]; 
Igbo, ibo, IGBO ONITSHA; Igala, igl, IGALA; Yoruba, yor, YORUBA; Ora, ema, 
EMAI; Bini, bin, EDO; Urhobo, urh, URHOBO; Isoko, iso, ISOKO; Degema, deg, 
DEGEMA 2; Aten, etx, ITEN; Mambila, mzk, MAMBILA; Tiv, tiv, TIV 2; Tunen, 
baz, TUNEN; Jarawa, jar, BANKALA; Bobangi, bni, BOBANGI; Nyanja, nya, 
NYANJA; Kikuyu, kik, GIKU YU; Kwanyama, kua, KWANYAMA. Gur (Swadesh 
et al. 1966): Basal, bud, BASSARI; Konkomba, xon, KONKOMBA 2; Gurma, 
gux, GOURMANCHEMA, Pilapila, pil, YOM; Naudem, nmz, NAWDM; Buli, 
bwu, BULI GHANA; Dagbani, dag, DAGBANI; Mampruli, maw, MAMPRULI; 
Kusal, kus, KUSAL; Hanga, hag, HANGA; Frafra, gur, NINKARE; Moore, mos, 
MOORE; Dagaari, dga, DAGAARE; Vagala, vag, VAGALA; Sisala, sil, SISAALA 
TUMULUNG; Kasem, xsm, KASEM; Lamba, las, LAMA; Kabre, kbp, KABIYE; 
Mambar, myk, MAMARA SENOUFO; Pantera + Fantera, nfr, NAFAARA. Kwa 
(Heine 1968): Twi, aka, TWI ASANTE; Logba, lgq, IKPANA; Adele, ade, ADELE; 
Lipke, lip, LIKPE; Santroko, snw, SELE; Akpufu, akp, AKPAFU; Lelemi, lef, 
BUEM LELEMI; Avatime, avn, SIDEME; Nyangbo, nyb, TUTRUGBU; Bowili, 
bov, TUWULI; Ahlo, ahl, AHLO; Animere, anf, ANIMERE; Ewe, ewe, EWE 
ADANGBE. 


Nilo-Saharan: Eastern Sudanic (Thelwall 1981): Meidob, mei, MEIDOB 
NUBIAN; Debri, dil, DILLING; Dongolawi, kzh, NUBIAN OF DONGOLA; 
Nobiin, fia, NOBIIN; Gaam, tbi, INGASSANA; Liguri, liu, LOGORIK; Shatt, shj, 
SHATT; Nyala + Lagowa, daj, NYALA; Sila, dau, SILA; Temein, teq, TEMEIN; 
Dinka, dik, REK; Shilluk, shk, SHILLUK; Nilo-Saharan™ (Bender 1971): Nuer, 
nus, NUER; Anyuak, anu, ANYUAK; Shilluk, shk, SHILLUK; Jumjum, jum, 
JUMJUM; Mabaan, mfz, MABAAN; Burun, bdi, BURUN; Inyangatom, nnj, 
NYANGATOM 2; Tirma, suq, TIRMA; Mursi, muz, MURSI; Meen, mym, MEEN; 
Kwegu, xwg, KWEGU; Zilmamu, koe, BAALE; Murle, mur, MURLE; Mesengo, 
mpe, MESENGO; Nara, nrb, NARA; Ingassana, tbi, INGASSANA; Kunama, 
kun, KUNAMA; Wetawit, wti, WETAWIT; Uduk, udu, UDUK; C. Koma, xom, 
CENTRAL KOMA; Langa, lgn, LANGA; N. Koma, kmq, GWAMA; Gumuz, guk, 
GUMUZ; Southern Luo” (Blount and Curley 1970): Lango, laj, LANGO; Acholi, 
ach, ACHOLI 2; Alur, alz, ALUR; Luo, luo, LUO; Shilluk, shk, SHILLUK. 


Quechuan: Quechua (Torero 1970): Corongo, qwa, YANAC; Caras, qwh, QUE- 
CHUA HUAYLAS ANCASH; Tarma, qen, QUECHUA NORTH JININ; Ferreñafe, 
quf, INKAWASI; Cajamarca, qvc, CHETILLA; Chachapoyas, quk, QUECHUA 
CHACHAPOYAS; Ayachuco, quy, QUECHUA AYACUCHO; Cuzco, quz, QUE- 
CHUA DE CUSCO; Potosí + Chuquisaca, quh, MARAGUA. 
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Salishan: Salish (Swadesh 1950): Bella Coola, blc, BELLA COOLA; Comox, 
coo, SLIAMMON; Seshelt, sec, SECHELT; Fraser + Nanaimo, hur, COWICHAN; 
Squamish, squ, SQUAMISH; Lkungen + Lummi, str, SALISH STRAITS; Clallam, 
clm, CLALLAM; Nootsak, nok, NOOKSACK; Twana, twa, TWANA; Cowlitz, 
cow, COWLITZ; Chehalis + Satsop, cjh, CHEHALIS UPPER; Quinault, qun, 
QUINAULT; Tillamook, til, TILLAMOOK; Lillooet, lil, LILLOOET; Thompson, 
thp, THOMPSON; Shuswap, shs, SHUSWAP; Okanagan, oka, OKANAGAN 
COLVILLE; Spokane, spo, SPOKANE; Kalispel + Pend d’Oreille, fla, KALISPEL- 
PEND DOREILLE; Columbia, col, COLUMBIA WENATCHI; Coeur d'Aléne, 
crd, COEUR DALENE. 


Sino-Tibetan: Chinese (Xu 1991): Xiamen, nan, AMOY MINNAN CHINESE; 
Meixian, hak, HAKKA; Guangzhou, yue, CANTONESE; Changsha, hsn, XIANG; 
Suzhou, wuu, SUZHOU WU; Beijing, cmn, MANDARIN 2. Lolo-Burmese* 
(Peiros 1998): Burmese, mya, BURMESE; Zaiwa, atb, ZAIWA; Achang, can, 
ACHANG; Nusu, nuf, NUSU; Akha, ahk, AKHA; Biyue, byo, BIYUE; Lahu, 
lhu, LAHU; Jino, jiu, JINO; Mpi, mpz, MPI; Bisu, bzi, BISU; Xide, iii, XIDE; 
Dafang, yig, DAFANG; Nanjiang, ywt, NANJIANG; Lisu, lis, LISU; Naxi, nbf, 
NAXI. Sino-Tibetan (Benedict 1976): Burmese, mya, BURMESE; Tibetan, bod, 
TIBETAN LHASA; Lushai, lus, LUSHAI; Kachin, kac, JINGPHO; Garo, grt, 
GARO; Mandarin, cmn, MANDARIN 2. 


Tai-Kadai: Kadai* (Peiros 1998): Siamese, tha, SIAMESE; Longzhou, zzj, 
ZHUANG SOUTHERN; Zhuang, zyb, ZHUANG NORTHERN; Saek, skb, SAEK; 
Ong Be, onb, ONG BE; Lakkja, lbc, LAKKJA; Mulao, mlm, MULAO; Kam, kmc, 
SOUTHERN DONG; Maonan, mmd, MAONAN; Sui, swi, SUI. 


Torricelli: Kamasau* (Sanders and Sanders 1980): Tring, kms, TRING; Wau, kms, 
WAU; Kamasau, kms, KAMASAU; Yibab, kms, YIBAB; Wandomi, kms, WAN- 
DOMI; Kenyari, kms, KENYARI; Paruwa, kms, PARU WA; Samap, kms, SAMAP. 


Trans-New Guinea: Angan* (Lloyd 1973): Angaataha, agm, ANGAATAHA; 
Ankave, aak, ANKAVE; Ampale, apz, AMPALE; Baruya, byr, BARUYA 2; 
Ivori, ago, IVORI; Kamasa, klp, KAMASA; Kapau, hmt, KAPAU; Kawacha, 
kcb, KAWACHA; Lohiki, miw, LOHIKI; Menya, mcr, MENYA 2; Simbari, smb, 
SIMBARI; Yagwoia, ygw, YAGWOIA. Awyu‘® (Voorhoeve 1968): Aghu, ahh, 
AGHU; Kaeti, bwp, KAETI; Pisa, psa, PISA; Syiagha, aws, SIAGHA; Yenimu, 
awy, YENIMU; Wambon, wms, WAMBON. Bosavi (Shaw 1986): Duna, duc, 
DUNA; Bimin, bhl, BIMIN; Bogaia, boq, BOGAYA; Pare, ppt, PARE; Agala, agl, 
AGALA; Kubo, jko, KUBO; Samo, smq, SAMO); Bibo, goi, GEBUSI; Honibo, 
goi, HONIBO; Oibae, goi, OIBAE; Kalamo, kkc, ODOODEE; Bedamini, beo, 
BEDAMINI; Etoro, etr, ETORO; Onabasulu, onn, ONABASULU; Kaluli, bco, 
KALULI 2; Sunia, sig, SUNIA; Kasua, khs, KASUA 2; Aimele, ail, AIMELE; 
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Kamula, xla, KAMULA; Bainapi, dby, DIBIYASO; Namumi, faa, NAMUMI; 
Bamu, bcf, BAMU 2. Eleman* (Brown 1973): Aheave, xeu, AHEAVE; Kaipi, 
oro, KAIPI; Keuru, xeu, KEURU; Opao, opo, OPAO; Orokolo, oro, OROKOLO 2; 
Sepoe, tqo, SEPOE; Toaripi, tqo, TOARIPI 2; Uaripi, uar, UARIPI. Finisterre 
(Claasen and McElhanon 1970): Nankina, nnk, NANKINA; Awara, awx, AWARA; 
Wantoat, wnc, WANTOAT; Nek, nif, NEK; Yabong, ybo, YABONG; Saep, spd, 
SAEP; Ganglau, ggl, GANGLAU; Kolom, klm, KOLOM; Suroi, ssd, SUROI; 
Lemio, lei, LEMIO; Usino, urw, USINO; Sinsauru, snz, SINSAURU. Grand 
Valley Dani* (Bromley 1967): Upper Pyramid, dni, UPPER PYRAMID DANI; 
Pyramid-Wodo, wlw, PYRAMID WODO; Mid-Grand Valley, dnt, MID GRAND 
VALLEY DANI; Lower Valley Hitigama, dni, HITIGIMA DANI; Lower Valley 
Tangma, dni, TANGMA DANI; Jalimo Angguruk, yli, ANGGURUK YALI; 
Kiniageima Amo, wul, KINIAGEIMA. Greater Madang* (Z’Graggen 1969): 
Isebe, igo, ISEBE; Bau, bbd, BAU; Amele, aey, AMELE; Garus, gyb, GARUS; 
Yoidik, ydk, YOIDIK; Rempi, rmp, REMPI; Garuh, gaw, GARUH; Foran, fad, 
KAMBA; Mawan, mcz, MAWAN; Utu, utu, UTU; Saruga, sra, SARUGA; Kare, 
kmf, KARE; Usino, urw, USINO; Sumau, six, SUMAU; Urigina, urg, URIGINA; 
Korak, koz, KORAK; Waskia, wsk, WASKIA; Malas, mkr, MALAS; Bunabun, 
buq, BUNABUN; Dimir, dmc, DIMIR; Pay, ped, PAY; Pila, sks, PILA; Saki, sks, 
SAKI; Tani, pla, TANI; Ulingan, mhl, ULINGAN; Bepour, bie, BEPOUR; Mawak, 
mjj, MAWAK; Musar, mmi, MUSAR; Wanambre, wnb, WANAMBRE; Wanuma, 
wnu, WANUMA; Yaben, ybm, YABEN; Parawen, prw, PARAWEN; Amaimon, ali, 
AMAIMON; Moresada, msx, MORESADA; Ikundun, imi, IKUNDUN; Pondoma, 
pda, PONDOMA; Wanambre, wnb, WANAMBRE; Katiati, kqa, KATIATI; Osum, 
omo, OSUM; Atemple, ate, ATEMPLE; Angaua, anh, ANGAUA; Emerum, ena, 
EMERUM; Musak, mmg, MUSAK; Paynamar, pmr, PAYNAMAR; Kaian, kct, 
KAIAN; Gamei, gai, GAMEI; Mikarew, msy, MIKAREW MAKARUB; Anor, 
anj, ANOR; Rao, rao, RAO; Banaro, byz, BANARO. Huon* (McElhanon 1967): 
Kâte, kmg, KATE; Dedua, ded, DEDUA; Mape, mlh, MAPE 2; Hube, kgf, HUBE; 
Tobo, tbv, TOBO; Kosorong, ksr, KOSORONG; Mindik, bmu, MINDIK; Burum, 
bmu, BURUM; Ono, ons, ONO; Komba, kpf, KOMBA; Selepet, spl, SELEPET; 
Timbe, tim, TIMBE; Nabak, naf, NABAK; Momolili, mci, MOMOLILI. Kiwaian 
(Wurm 1973): Wabuda, kmx, WABUDA; Middle Bamu Kiwai, bcf£, BAMU; 
Morigi, mdb, MORIGI; Kerewo, kxz, KEREWO; Urama, kiw, URAMA; Gope, 
kiw, GOPE; Gibaio, kiw, GIBAIO; Arigibi / Anigibo / Anigibi / Ani, kiw, 
ANIGIBI. Koiarian (Dutton 1969): Koita, kqi, KOITA; Koiari, kbk, KOIARI 2; 
MtnKoiari, kpx, MOUNTAIN KOIARI; Aomie, aom, AOMIE; Barai, bbb, BARAI; 
Managalasi, mcq, ESE MANAGALASI. Kolopom (Voorhoeve 1968): Kimaghana, 
kig, KIMAGHAMA; Riantana, ran, RIANTANA; Ndom, nqm, NDOM. Ok 
(Voorhoeve 1968): Asmat, cns, ASMAT CENTRAL; Telefol, tlf, TELEFOL; Kati, 
yon, NORTH KATI; Aghu, ahh, AGHU; Mombum, mso, MOMBUN; Turama- 
Kikorian (Franklin 1973): Omati, mgx, OMATI; Ikobi, meb, IKOBI; Mena, meb, 
MENA; Kairi, klq, RUMU. 
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Uto-Aztecan: Uto-Aztecan (Miller 1984, Cortina-Borja and Valiñas 1989): North- 
ern Paiute, pao, NORTHERN PAIUTE; Panamint, par, PANAMINT; Shoshoni, 
shh, SHOSHONI; Comanche, com, COMANCHE; Kawaiisu, xaw, KAWAIISU; 
Southern Paiute, ute, SOUTHERN PAIUTE; Ute, ute, UTE 2; Tübatulabal, tub, 
TUBATULABAL; Cahuilla, chl, CAHUILLA; Cupeno, cup, CUPENO; Luiseno, 
lui, LUISENO; Hopi, hop, HOPI; Papago, ood, TOHONO OODHAM; Nevome, 
ood, UPPER PIMA; Northern Tepehuan, ntp, NORTHERN TEPEHUAN; Guarijio, 
var, WARIHIO; Tarahumara, tar, CENTRAL TARAHUMARA,; Opata + Eudeve, 
opt, OPATA; Mayo, mfy, MAYO; Yaqui, yaq, YAQUI; Tubar, tbu, TUBAR; 
Huichol, hch, HUICHOL; Cora, crn, EL NAYAR CORA; Tetelcingo Nahuatl, nhg, 
TETELCINGO NAHUATL; Zacapoaxtla Nahuatl, azz, HIGHLAND PUEBLA 
NAHUATL; Pipil, ppl, PIPIL. 


West Papuan: North Halmahera (Chlenov 1986): Loda, loa, LODA; Galela, gbi, 
GALELA; Tobelo, tlb, TOBELO; Tabaru / Tobaru, tby, TABARU; Pagu / Isam, pgu, 
PAGU; Madole / Modole, mgo, MADOLE; Sahu, saj, SAHU; Tidore, tvo, TIDORE. 
Yawa* (Jones 1986): Tindaret, yva, TINDARET; Ambaidiru, yva, AMBADAIRU; 
Ariepi, yva, ARIEPI; Sarawandori, yva, SARAWANDORI; Konti Unai, yva, 
KONTI UNAI; Mariadei, yva, MARIADEI. 


Notes 


1. We thank Cecil H. Brown and Robert Mailhammer for providing cognate judg- 
ments for Mayan and Iwaidjan, respectively. An earlier version of this paper was 
presented at the ICHL in Osaka, July 2011. We are grateful to Claire Bowern for 
highly useful comments on that occasion. 

2. The normalization leading to LDN is argued by Serva and Petroni (2008) to 
be absolutely necessary for arriving at good results for language classification. 
To our knowledge, other types of normalization have not been tested for string 
comparisons involving languages that are not closely related, analyses having 
been limited to the field of dialectology (Heeringa 2005). This is a potentially 
interesting item for future research. 

3. This variation in lists would be expected if anything to increase the variability 
of COGNsim relative to ASJPim, which is always based on the same list. The 
additional variability would tend to weaken the observed correlations, thus 
rendering our tests conservative. 

4. An exception is the name ‘Greater Madang’, which we use as a collective term 
for the following Trans-New Guinea groups included in Z’Graggen (1969): 
Madang, Isumrud, Kaukombaran, Mawamuan, Pihom, Josephstaal, and 
Wanang. 

5. An exception where we provide cognate judgments ourselves is the Torricelli 
group Kamasau. This is a set of dialects for which it requires no special expertise 
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to distinguish cognates from non-cognates. In the case of Mixe-Zoque we used 
the judgments of Cysouw et al. (2006), but corrected for some typos in that paper. 

6. Not surprisingly, the magnitude of the correlation between COGNsim and 
ASJPsim is greater when both measures are based on the same data set. In the 
appendix we indicate for each language group whether the data sets for the two 
measures are the same, almost so, or not. Families containing language groups 
where the data sets are all the same or almost so include AA, Alt, AuA, Dra, 
HM, Jap, May, MZ, NDe, TK, and Tor. The average r for these families is .81. 
Families containing groups where the data sets are all different include Car, 
IE, MGe, NC, Que, Sal, and UA. The average r for these is .70. The families 
Aus, NS, ST, TNG, and WP are represented by language groups for which the 
sources are mixed, some data sets being the same or almost so and some not. 
Average r for these families is .77. 

7. It was observed by a referee that if mmWL correlates significantly with both 
mDIFF and mRATIO, while mSR does not, then mSR and mmWL perhaps 
do not correlate significantly, which would run counter to previous observa- 
tions about an inverse relationship between word length and segment inventory 
sizes (Nettle 1995, 1998; Wichmann et al. 2011). The data used in the present 
paper are limited, so differences from the results of Wichmann et al. (2011) with 
regard to word length and segment inventory size are entirely expected and not 
particularly telling. The correlation in the present data is r = —.38, p = .07. These 
figures depend highly on the small number of data points, as can be seen by an 
increase in the correlation to r = —74, p < .0001 with the removal of a single 
outlier, Salishan. 
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